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Abstract 

The rise of socially targeted marketing suggests that decisions made by consumers 
can be predicted not only from their personal tastes and characteristics, but also from 
the decisions of people who are close to them in their networks. One obstacle to consider 
is that there may be several different measures for "closeness" that are appropriate, ei- 
ther through different types of friendships, or different functions of distance on one kind 
of friendship, where only a subset of these networks may actually be relevant. Another 
is that these decisions are often binary and more difficult to model with conventional ap- 
proaches, both conceptually and computationally. To address these issues, we present a 
hierarchical model for individual binary outcomes that uses and extends the machinery 
of the auto-probit method for binary data. We demonstrate the behavior of the pa- 
rameters estimated by the multiple network-regime auto-probit model (m-NAP) under 
various sensitivity conditions, such as the impact of the prior distribution and the na- 
ture of the structure of the network, and demonstrate on several examples of correlated 
binary data in networks of interest to Information Systems, including the adoption of 
Caller Ring-Back Tones, whose use is governed by direct connection but explained by 
additional network topologies. 



1 Introduction 



The prevalence and widespread adoption of online social networks have made the analysis 
of these networks, particularly the behaviors of individuals embedded within, an important 



topic of study in information systems Agarwal et al. (2008); Oinas-Kukkonen et al. (2010) 



building off previous work in the context of technology diffusion Brancheau and Wetherbe 



(1990); Chatterjee and Eliashberg (1990); Premkumar et al. (1994). While past investiga- 



tions into behavior in networks were typically limited to hundreds of people, contemporary 
data collection and retrieval technologies enable easy access to network data on a much larger 
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scale. Analyzing the behavior of these individuals, such as their purchasing or technology 
adoption tendencies, requires statistical techniques that can handle both the scope and the 
complexity of the data. 



The social network aspect is one such complexity. Researchers once assumed that an 
individual's decision to purchase a product or adopt a technology is solely associated with 



their personal attributes, such as age, education, and income Kamakura and Russell (1989) 



Allenby and Rossi (1998), though this could be due both to a lack of social network data and 



a mechanism for handling it; indeed, recent developments have shown that their decisions are 



associated with the decisions of an individual's neighbors in their social networks Bernheim 



(1994 


); 


Manski 


(2000 


); 


Smith and LeSage 


(2004) 



where someone imitates the behavior of their friends, or an indication of latent homophily, 
in which some unobserved and shared trait drives both the tendency for two people to form 



a friendship and for each to adopt (Aral et al. 2009 Shalizi and Thomas, 2011); either so- 



cial property will increase the ability to predict a person's adoption behavior beyond their 
personal characteristics. 



Each of these produces outcomes that are correlated between members of the network 
who are connected. A popular approach to study this phenomenon is to use a model with 
explicit autocorrelation between individual outcomes, defined with a single network structure 
term. With the depth of data now available, an actor is very often observed to be a member 
of multiple distinct but overlapping networks, such as a friend network, a work colleague net- 
work, a family network, and so forth, and each of these networks may have some connection 
to the outcome of interest, so a model that condenses all networks into one relation will be in- 
sufficient. While models have been developed to include two or more network autocorrelation 



terms, such as Doreian (1989), these do not allow for the immediate and principled inclusion 
of binary outcomes; other methods to deal with binary outcomes on multiple networks, such 
as Yang and Allenby (2003), instead take a weighted average of other networks in the system, 
combining them into one, which has the side effect of constraining the sign of each network 
autocorrelation component to be identical, which may be undesirable if there are multiple 
effects thought to be in opposition to one another. 



To deal with these issues, we construct a model for binary outcomes that uses the probit 
framework, allowing us to represent these outcomes as if they are dichotomized outcomes 



from a multivariate Gaussian random variable; this is then presented as in Doreian (1989) to 
have multiple regimes of network autocorrelation. We first use the Expectation-Maximization 
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algorithm (EM) to find a maximum likelihood estimator for the model parameters, then use 
Markov Chain Monte Carlo, a method from Bayesian statistics, to develop an alternate es- 
timate based on the posterior mean. We also study the sensitivity of both solutions to the 
change of parameters' prior distribution. Preliminary experiments show that the E-M solu- 
tion to this model is degenerate, and cannot produce a usable variance-covariance matrix for 
parameter estimates, and so the MCMC method is preferred. Our software is also validated 



by using the posterior quantiles method of Cook et al. (2006). We ensure that the parameter 
estimates from the model are correct by testing first on simulated data, before moving on to 
real examples of network-correlated behavior. 



The rest of the paper is organized as follows. We discuss the literature on the network 
autocorrelation model in Section [2j Our two estimation algorithms for the multi-network 
autoprobit, based on EM and MCMC, are presented in Section [3} In Section [4] we present the 
results of experiments for software validation and parameter estimation behavior observation. 
Conclusions and suggestions for future work complete the paper in Section [5] 



2 Background 



[[Previously: Literature]] Network models of behavior are developed to study the process of 
social influence on the diffusion of a behavior, which is the process "by which an innovation 
is communicated through certain channels over time among the members of a social system 
... a special type of communication concerned with the spread of messages that are perceived 



as new ideas" Rogers (1962). These models have been widely used to study diffusion since 



the Bass ( 1969 ) model, a population-level approach that assumes that everyone in the social 



network has the same probability of interacting. Such assumption is not realistic because 
given a large social network, the probability of any random two nodes connecting to each other 
is not the same; for example, people with closer physical distance communicate more and 
are likely to exert greater influence on each other. A refinement to this approach is a model 
where the outcomes of neighboring individuals are explicitly linked, such as the simultaneous 



autoregressive model (SAR). The general method of SAR is described in Anselin (1988) and 



Cressie (1993); it considers simultaneous autoregression on the residuals of the form 



y = X/3 + 0, 6 = pW6 + e 



where y is a vector of observed outcomes, in this case consumer choice; X is a vector of 
explanatory variables. Rather than an independent error term, 6 represents error terms whose 
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correlation is specified by W, the social network matrix of interest, and p, the corresponding 
network autocorrelation, distributing a Gaussian error term e^. 



Maximum likelihood estimate solutions are provided by Ord (1975), Doreian (1980, 1982), 



and Smirnov (2005). 



Standard network autocorrelation models can only accommodate one network, such as 



those of Burt (1987) and Leenders (1997). However, an actor is very often under influence 
of multiple networks, such as that of friends and that of colleagues. So if a research requires 
investigation of which autocorrelation term out of multiple networks plays the most signifi- 
cant role in consumers' decision, none of these models are adequate, and a model that can 
accommodate two or more networks is necessary. 



Cohesion and structural equivalence are two competing social network models to explain 
diffusion of innovation. In the cohesion model, a focal person's adoption is influenced by 
his/her neighbors in the network. In the structural equivalence model, a focal person's 
adoption is influenced by the people who have the same position in the social network, 
such as sharing many common neighbors. While considerable work has been done on these 
models on real data, the question of which network model best explains diffusion has not been 



resolved. To approach this, Doreian (1989) introduced a model for "two regimes of network 
effects autocorrelation''^ for continuous outcomes. The model is described as below: 



y = X/3 + piWiy + p 2 W 2 y + e 



where y is the dependent variable; X is a vector of explanatory variables; each W represents 
a social structure underlying each autoregressive regime. This model takes both interde- 
pendence of actors and their attributes, such as demographics, into consideration; these 
interdependencies are each described by a weight matrix W*. Doreian's model can capture 
both actor's intrinsic opinion and influence from alters in his social network. 



As this model takes a continuous dependent variable, Fujimoto and Valente (2011 ) present 



a plausible solution for binary outcomes by directly inserting an autocorrelation term Wy 



■"■The term "network effects" can refer to two directly related concepts: the autocorrelation between indi- 
vidual behaviors on a network, and the increased impact of a technology to an individual when used by more 
people within a network. Our meaning is the first, though we use the term partial network autocorrelation 
to avoid ambiguity. 
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into the right hand side of a logistic regression: 



yt ~ Be(pi) 



Pi 



Due to its speed of implementation, this method is called "quick and dirty" (QAD) by Dor 



eian 



(1982). Although it may support a binary dependent variable and multiple network 



terms, this model does not satisfy the assumption of logistic regression - the observations are 



not conditionally independent, and the estimation results are biased. Thomas (2012) shows 
that this method has more consequences than expected for the estimation procedure beyond 
simple bias; for example, in cases where W is a directed graph, networks that are directional 
cannot be distinguished from their reversed counterparts. 



Yang and Allenby (2003) propose a hierarchical Bayesian autoregressive mixture model 
to analyze the effect of multiple network autocorrelation terms on a binary outcome. Their 
model can only technically accommodate one network effect, composed of several smaller 
networks that are weighted and added together. This model therefore assumes that all 
component network coefficients must have the same sigrj^J and also be statistically significant 
or insignificant together. Such assumptions do not hold if the effect of any but not all of 
the component networks is statistically insignificant, or of the opposite sign to the other 
networks, so a method that estimates coefficients for each W separately is necessary for our 
applications. We contrast our method with the Yang- Allenby grand W construction method, 



a finite mixture of coefficient matrices, in Appendix A. 5 



3 Method 

We propose a variant of the auto-probit model that accommodates multiple regimes of net- 
work autocorrelation terms for the same group of actors, which we call the multiple network 
auto-probit model (m-NAP). We then provide two methods to obtain estimates for our model. 
The first is the use of Expectation-Maximization, which employs a maximum likelihood ap- 
proach, and the second one is a Markov Chain Monte Carlo routine that treats the model as 
Bayesian. Detailed descriptions of both estimations are shown in Appendix A.l and A. 2 



2 It is of course possible to specify terms in the W matrix as negative, to represent anticorrelation on a 
tie, but this must be done a priori, and is redundant in our approach. 
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3.1 Model Specification 

The actors are assumed to have k different types of network connections between them, where 
Wj is the i th network in question % e {1, k}. y is the vector of length n of observed binary 
choices, and is an indicator function of the latent preference of consumers z. If z is larger 
than a threshold 0, consumers choose y as 1; if z is smaller than 0, then consumers would 
choose y as 0. 

y = I(z>0) 

z = X/3 + + e, e~ Normal„(0, I n ) 

k 

= ^ PiWfi + u, u ~ Normal n (0, a 2 I n ) 
i=i 

z is a function of both exogenous covariates X, autocorrelation term 6, and individual error. 
X is an n x m covariate matrixthat includes a constant as its first column; these covari- 
ates could be the exogenous characteristics of consumers. (3 is an m x 1 coefficient vector 
associated with X. 6 is the autocorrelation term, which is responsible for those nonzero 
covariances in the z. can be described as the aggregation of multiple network structure Wj 
and coefficient pj. Each Wj is a network structure describing connections and relationships 
among consumers. 

Our model explicitly allows multiple competing networks that can be defined by different 
mechanisms on an existing basis of network ties; for example, Wi describes an effect acting 
directly on a declared tie, such as homophily or social influence, whereas W 2 describes the 
structural equivalence due to those ties. It can also be that each Wj is defined by a different 
type of network edge, such as friendship, colleagueship, or mutual group membership; note 
that none of these relationships must be mutually exclusive. Each coefficient pi describes 
the effect size of its corresponding network Wj,so that we can compare the relative scales of 
competing network structures for the same group of actors embedded in social networks. 

The error term for the model is modeled as an augmented expression that consists of 
two parts, e and u. e is the unobservable error term of z that describes individual-level 
variation that is not shared on the network, and u is the error that is then distributed along 
each network, accounting for the non-zero covariance between units. If we marginalize this 
model by integrating out 6, all the unobserved interdependency will be isolated in a single 
expression for the distribution of z, given parameters /3, p and a 2 , as multivariate with mean 
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X/3 and variance Q. 



z ~ Normal (X/3, Q) 



where 



i 




i 



) 



T 



Q = I n + cr 



) 



) 



The non-standard form of the covariance matrix can therefore pose a significant computa- 
tional issue. 

3.2 Expectation-Maximization Solution 

We first develop an approach by maximizing the likelihood of the model using E-M. Since z 
is latent, we treat it as unobservable data, for which the E-M algorithm is one of the most 
used methods. Detailed description of our solution for k regimes of network autocorrelation 
is in Appendix |A.1[ 

The method consists of two steps: first, estimate the expected value of functions of the 
unobserved z given the current parameter set <fi, (0 = {/3, p, a 2 }). Second, use these esti- 
mates to form a complete data set {y, X, z}, with which we estimate a new cf> by maximizing 
the expectation of the likelihood of the complete data. 

We first initialize the parameters to be estimated, 



where i — 1, m, and j = 1, k. Let these values equal 4> ( ' . 

For the E-step, we calculate the conditional expectation of the log-likelihood, with respect 



Normal(z/g, VLp); 
Normal (z/ p , Q p ); 
Gamma (a, b) 
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to the augmented data, 

G(0|0«) = E z||yi0(t) [logL(0|z,y)] 

-. n n 

= -| log 2tt - | log | Q | -- ^ Yl %-( E [^l " E N^i/3 - Efe]X,/3 + 

i=i i=i 

where t is the current step number and is element in the matrix Q _1 . 

In the M-step, we maximize G(<p I 4> ) to get /3* +1 , p <+1 and [cr 2 ](' +1 ) for the next step. 

0(t+i) = argmax G (/3 | p« [<r 2 ] W ); 

= argmax G(p | /3 (m) , [<r 2 ] W ); 
p 

[a 2 f +1) = argmax G([<r 2 ] | fi t+1 \ p^) 
We replace 0® with ( - t+1 ) and repeat the E-step and M-step until all the parameters con- 



verge. Parameter estimates from the E-M algorithm converge to the MLE estimates Wu 



(1983). 



It is worth noting that the analytical solution for all the parameters is not always possible. 
Consider the maximization with respect to the autocorrelation variance parameter a 2 : 

[a 2 f+V = argmax G(0 | W ) 

^^(4.o g |Ql4(z-X m -X« 
The first term at the the right hand side of Equation Q is: 



log | Q | = — ^— log 



d[a 2 ] ° 1 d[a 2 



i=l I \ \ i=l 
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The second term is: 



d[a 







(z - X/3)~ 



V 1 



(z - X/3) 



/ 



This is not solvable analytically, and numerical methods are needed to get the estimators for 
this parameter and for p. 

As it happens, the E-M algorithm produces a degenerate solution. This is because it 
estimates the mode of a 2 , the error term of the autocorrelation term 6, which is at (see 
Figure [T]), and produces a singular variance-covariance matrix estimate using the Hessian 
approximation. Thus we have to find another solution. 



Distribution of c 



Figure 1: An estimated probability distribution for a 2 , variance of 6. Maximum likelihood 
methods, such as the Expectation-Maximization method, will choose a 2 = 0, a degenerate 
solution. 



3.3 Full Bayesian Solution 

We turn to Bayesian methods. Since the observed choice of consumer's is decided by his/her 
unobserved preference, this model has a hierarchical structure, so it is natural to think of 



Table 1: Cyclical conditional sampling steps for Markov Chain Monte Carlo 



Parameter 


Density 


Draw Type 


z 


TrunNormal n (X/3 + 0, Q 


Parallel 


P 


Normaln^, ftp) 


Parallel 


e 


Normal n (i/ e , Q g ) 


Parallel 


a 2 


InvGamma(a, b) 


Single 


Pi 


Metropolis step 


Sequential 



using a hierarchical Bayesian method. In addition to the model specification above, prior 
distributions for each of the highest-level parameters in the model are also required. As 
before, y is the observed dichotomous choice and calculated by the latent preference z. With 
Markov Chain Monte Carlo, we generate draws from a series of full conditional probability 
distributions, derived from the joint distribution. We summarize the forms of the full con- 
ditional distributions of all the parameters to estimate in Table[TJ and in full in Appendix |A.2 



Given the observed choice of consumer, the latent variable z is generated from a truncated 
normal distribution with a mean of X/3 + 6 with unit error. The prior distributions of the 
parameters (shown in Table [I] are adapted from priors proposed by Smith and LeSage (2004): 



• (3 follows a multivariate normal distribution with mean up and variance Qp. 

• a 2 follows an inverse gamma distribution with parameters a and b. 

• Each pi follows a normal distribution with mean v p and variance fl p . 

The sampler algorithm was constructed in the R programming language, including a 
mechanism to generate data from the model. Validation of the algorithm was conducted 



using the method of posterior quantiles (Cook et al. , 2006), ensuring the correctness of the 



code for all analyses. Posterior quantiles is a simulation-based method that generates data 
from the model and verifies that the software can generate parameter estimate randomly 
around true parameter. For detailed description of the implementation, please see Appendix 



3.4 Sensitivity to Prior Specification 

We test the performance of the sampler using prior distributions that are closer to our 
chosen model than the trivial priors used to check the model code in order to assess the 
behavior of the algorithm under non-ideal conditions. We demonstrate on data simulated 



from the model, using two pre-existing network configurations, and specify different prior 
distributions for each parameter. To demonstrate, we choose a prior distribution for p\ with 
high variance, p ~ Normal(0, 100), . As shown in Figure 2(a) , the posterior draws of p\ have 
high temporal autocorrelation. To compare, we choose a narrow prior distribution for pi, pi ~ 
Normal(0.05, 0.05 2 ); the posterior draws for p\ are shown in Figure 2(b), and the temporal 
autocorrelation is considerably smaller. With the volume of data under consideration, it is 
clear that the posterior distribution of p is sensitive to its prior distribution. 



True p, = 0.092 



True p, = 0.092 




i 1 1 1 r 

200 400 600 800 1000 
Experiment index 

(a) p ~ Normal(0, 100) 



200 



40C 



500 



800 



"TOO 



Experiment index 



(b) p ~ Normal(0.05, 0.05 2 ) 



Figure 2: Testing the sensitivity of the inference of an autocorrelation parameter p\ to the 
prior distribution, (a) The Markov Chain for a weakly informative prior distribution is 
consistent with the "oracle" value pi, but the chain has significant temporal autocorrelation, 
(b) The Markov Chain with a strongly informative prior distribution has much less temporal 
autocorrelation, but is beholden to its prior distribution more than the data. 



In most of our examples, we do not have a great deal of prior information available on any 
network parameters, suggesting that most of our analyses will be conducted with minimally 
informative prior distributions. With such high autocorrelation between sequential draws, 
the effective sample size is extremely small. We therefore use a high degree of thinning to 
produce a series of uncorrelated draws from the posterior. 
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4 Applications 



4.1 Auto Purchase Data of Yang and Allenby (2003) 



We use | Yang and Allenbyf s |2003| Japanese car data to compare the findings of our method 
with those in the original study. The data consists of information on 857 purchase decisions 
of mid-size cars; the dependent variable is whether the car purchased was Japanese (y m = 1) 
or otherwise (y m = 0). All the car models in the data are substitutable and have roughly 
similar prices. 

An important question of interest is whether the preferences of Japanese car among 
consumers are interdependent or not. The interdependence in the network is measured by 
geographical location, where Wij = 1, if consumer i and j live in the same zip code, and 
0, otherwise. Explanatory variables include actors' demographic information such as age, 
annual household income, ethnic group, education and other information such as the price of 
the car, whether the optional accessories are purchased for the car, latitude and longitude of 
the actor's location. To construct a network, Yang and Allenby use whether the consumers' 
home address in the same zip code as the indicator of a connection. Thus the network struc- 
ture W, the cohesion, is the joint membership of same geographic area. 

By comparing the parameters of Yang and Allenby's model to those for m-NAP on the 
same dataset, with the same underlying definition of network structure, we contrast our 
approaches and demonstrate the value of separating the impact of various network auto- 
correlations. The comparison of the coefficient estimates from Yang and Allenby and our 
Bayesian solution is shown in Figure [3] , for both explanatory variables and for network auto- 
correlations. We specify a second network term W 2 to be the structural equivalence of two 
consumers, calculated as the simple adjacency distance between the two vectors representing 
individuals' connections to other individuals in the network to measure structural equiva- 
lence. In a undirected network with non-weighted edges the adjacency distance between 
two nodes i and j is the number of individuals who have different relationships to i and j 
respectively, 



N 



x (A ik -A jk )*, (2) 

\ k=l,k^=i,j 



where A ik = 1 if node i and k are neighbors, and otherwise. The larger d between node 
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i and j, the less structurally equivalent they are. We use the inverse of dij plus one in or- 
der to construct a measure with a positive, finite relationship with role equivalence, so that 
Sij = d 1 +1 . In our setting, a random element Aij in Equation (2) is from matrix Wi, so d 



is the adjacency distance between any two vectors Aj and Aj, representing consumer z's con- 
nections, and consumer j's connections to all the other consumers in the data, respectively. 
The inverse of d^ with an addition to 1 (to avoid zero as denominator), Sy, becomes element 
of structural equivalence matrix W2. 

The comparison is shown in Figure [3} Each box contains the estimates of one parameter 
from three methods: from left to right, Yang and Allenby, NAP with 1 network, and NAP 
with 2 networks. All the coefficient estimates, p 2 , and a 1 of the three methods have 
similar mean, standard deviation and credible interval. One thing interesting here is the ef- 
fect size of the second network, structural equivalence, has a significant negative effect. This 
suggests a diminishing cluster effect; when the number of people in the cluster gets bigger, 
the influence does not increase proportionally. 



4.2 Caller Ring-Back Tone Usage In A Mobile Network 

We use m-NAP to investigate the purchase of Caller Ring Back Tones (CRBT) within a cel- 
lular phone network, a technology of increasing interest around the world. When someone 
calls the subscriber of a CRBT, the caller does not hear the standard ring-back tone but 
instead hears a song, joke or other message chosen by the subscriber until the subscriber 
answers the phone or the mailbox takes over. As soon as a CRBT is downloaded, it is set 
as the default ring back tone, and triggered automatically by all phone call. Our data were 
obtained from a large Indian telecommunications company (source and raw data confiden- 
tial). We have cellular phone call records and CRBT purchase records over a three-month 
period, and phone account holders' demographic information such as age and gender. We 
extract a community of 597 users that are highly internally connected from a population 
with approximately 26 million unique users using the Transitive Clustering and Pruning (T- 
CLAP) algorithm (Zhang et al. 2011). Within this cluster, network edges are specified 
between users who call each other during the period of observation, as mutual symmetric 
connection implies equal and stable relationships (Hanneman and Riddle, 2005), rather than 
weaker relationships or calls related to businesses (inquiries or telemarketers). 



We include several explanatory variables in this model: 
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Figure 3: A comparison of coefficient estimates between the Yang-Allenby method and m- 
NAP with f or 2 networks. The models give similar results, while noting that there is now a 
negative and statistically significant effect on the network representing structural equivalence. 
f3 : coefficient of constant term, j3i. coefficient of X 1; car price; /3 2 : coefficient of X 2 , car's 
optional accessory; /3 3 : coefficient of X 3 , consumer's age; f5±: coefficient of X 4 , consumer's 
income; (3^. coefficient of X 5 , consumer's ethnicity; (3q: coefficient of X 6 , residence longitude; 
(3j: coefficient of X7, residence latitude; p\. coefficient of first network autocorrelation term, 
Wi, cohesion; p 2 : coefficient of the second network autocorrelation term, W 2 , structural 
equivalence; a 2 : estimated variance of the error term in autocorrelation. 
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• The gender of the cellular phone account holder; 

• The age of the account holder; 

• The number of unique outbound connections from the user (known as the "outdegree"). 

From our original network, we derive two matrices corresponding to cohesion and struc- 
tural equivalence. Cohesion assumes callers who make phone calls to each other will hear 
the called party's CRBT thus more likely to buy that ring-back tone or get interested in 
CRBT and eventually adopt the technology. Since the number of people a caller calls are 
drastically different, we normalize the cohesion matrix by dividing each row by the total 
number of adopters, to make the matrix element to be the percentage of adoption. Struc- 
tural equivalence is once again defined as the adjacency distance between two callers. Here 
it is less clear that there is an obvious mechanism for how structural equivalence can impact 
adoption, as it relates to a relationship that does not expose the caller to the CRBT. 



E(pp) = 0.0285 E(h)-D117 E(fi ; ) = 00345 Et> 3 ) = 0.000935 
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Figure 4: Trace plot of CRBT network parameters. Description of parameters: /3q: coefficient 
of constant term; /3i: coefficient of consumer's gender; fa- coefficient of consumer's age; P3: 
coefficient of number of called contacts; p\. coefficient of first network autocorrelation term, 
Wi, cohesion; p 2 : coefficient of the second network autocorrelation term, W 2 , structural 
equivalence; a 2 : estimated variance of the error term in autocorrelation; loglike: log-likelihood 
of y. 



We show estimates for each parameter of the model is shown in Figure 4.2 Again, we 
observe a significant negative effect for structural equivalence. This new network autocorre- 
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lation, with a coefficient of opposite sign from that of the first network autocorrelation Wi, 
cannot be identified by any earlier models. 

5 Conclusion 

We have introduced a new auto-probit model to study binary choice of a group of actors 
that have multiple network relationships among them. We specified the fitting of the model 
for both E-M and hierarchical Bayesian methods. We found that the E-M solution cannot 
estimate the parameters for this particular model, thus only hierarchical Bayesian solution 
can be used here. We also validated our Bayesian solution by using the posterior quantiles 
method and the results show our software returns accurate estimates. Finally we compare 
the estimates returned by Yang and Allenby, NAP with one network effect (cohesion), and 
NAP with two network effects (cohesion and structural equivalence), by using real data. 

We want to ensure that the approach can recover variability in the network effect size. 
Assuming W0 has strong effect, we will vary p's true value from small number to large num- 
ber, and observe whether our solution can capture the variation. 

Finally we also want to study how multicollinearities between Xs, and between X and 
W6 affect estimated results. 
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APPENDIX 



A.l E-M solution implementation 
A. 1.1 Deduction 

First, get the distribution of 6. 



8=1 

-1 



0=[l n -Y,PiWi) u 



1=1 



6 ~ Normal 



V 



i=l / \ \ i=l 



Then get the distribution of z\/3, p, a 2 : 

z ~ Normal (X/3, Q) , where Q = I n + a 2 ( J n - £ p.W, ) I ( J n - ^ p,W, 



-1 / / t x -1 N 



i=l / \ \ i=l 



The joint distribution of y and z can transformed as: 

p(y|z)p(z|/3, p, ct 2 ) = p(y, z|/3, p, cx 2 ) 

= p(z|y;/3,p,^ 2 )p(y) 

The right side of equation (|3| are two distributions we already have, as shown below. 

-Lexp (-i(z-X/3) T (z-X/3)) 

P(y) = V $(Xj3) ^ > 0) 

z|/3, p, a 2 ~ Normal(X/3, Q) 
z|y, X; (3, p, cr 2 ~ TrunNormal(X/3, Q) 



Consider parameter (3 only, 



K/3,z|y) =p(/3l z >y)p( z ly) 

z|y, X; /3 ~ TrunNormal(X/3, Q) 
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Assume Var(z)=l, 



V \ / 

3 = (X T X)- 1 X T R, where R = E[z|0,y] 



Then include parameters, p and a 2 . 

E[z](*+ 1 ) = ENy,i9W] = /09W,y) 
logL(/3,p,a 2 |z) = logp(z|/3, p,<r 2 ) 

n 

= log JJ^Zil^P,^ 2 ) 

8=1 



n 

£ log -^J= - (VcT'z - z T Q-'X/3 - X T /3Q-'z + X T /3Q-'X/3^) 



If decompose the matrices above as vector product, then: 

n ^ n n 



n n 



= E log "T^i ~ o E E - ZiXjP - z i x iP + XiXj/3 2 

i=i V Zn M z i= i j= i 

where is the element in Q, and Q = Q _1 . 
A. 1.2 Expectation step 

In the expectation step, get the expected log-likelihood of parameters. 



(4) 



Q(4>\4> 



(^)^ 



E z|y,0W [ lo g^(0l Z 5 y)] 
1 



E 



E lo e 



8=1 



E 



(z-X^'Q-^z-X/?) 



i n n 

log 2vr - £ log |Q| - - E E ?v( E fe*i] - E N*i£ - E[ Zj ]Xil3 + X t X,(3 2 



i=i j=i 
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where (f> is the parameter set, and t is the number of steps. 
A. 1.3 Maximization step 

In the maximization step, get the parameter estimates maximizing the expected log-likelihood. 
First, estimate (3 

pit+V = argmax Q(0|0 W ) 

n 

= argmax V log -== - -(z - X/3) T Q" 1 (z - X/3) (5) 

z 3 7^ v 27r IQI 2 



If directly apply analytical method to solve the Equation ^ above, then: 

-(z-X/3) T Q- 1 (z-X/3) 



d log L d ( \ t v rj , T | , 



9/3 9/3 V 2 

— (z - X/3) T Q- 1 (z - X/3) = — {^Qr^ - z T Q 1 X/3 - /3 T X T Q- X z + /3 T X T Q- x X/3) 

= -z T Q X X-X T Q 1 z + X T Q x X/3 (6) 

Set Equation (|6| as 0, then: 

-z T Q- 1 X - X T Q- 1 z + X T Q- 1 X/3 = 

/3 = (X T Q- 1 X) _1 X T Q- 1 R 

Second, estimate parameter p: 

p (m) = argmax QO|0 (i) ) 

Assume p = {pi, ...,pfc}, without losing any generalizabiliy, p\ can be estimated as: 

pf +1) = argmax Q(0|0 (<) ) 

pi 



<9l0gL 9 ^ 1, _|^| 1, VrJi -r h 



log|Q|--(z-X/3) l Q- 1 (z-X/3) 
api api \ 2 2 

<7 log | Q | = -trtWxCT 1 ) 



<9p 



— (z - X^TQ-^z - X/3) = — (z T Q- 1 z - z T Q- 1 X/3 - ^XTQ^z + /3 T X T Q" 1 X/3) 

dpi dp! 
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It is impossible to get the analytical solution for pi. 

Third, estimate parameter a 2 . Let a 2 = [a 2 ] 

[a 2 f+V = axg max Q(<£| 
dlogL d ( 1 1 - 



d[cr 2 ] d[cr 2 ] V 2 2 
The first term at the the right hand side of equation above is 



-- log |Q| - -(z - X/3) 1 Q- A (z - X/3) (7) 



9 i mi 9 i 

■log Q = 7TT^T lo g 



<9[<x 2 ] d[a 2 

The second term is: 

(z-X/3) T Q- 1 (z-X/3) 



4 + [^ 2 ] 4 - J> w * ( j n - X> w * 



i=l / \ \ i=l 



9[(T 



d 

(z-X/3) 



-1 



V 



1=1 / \ \ 1=1 



<% 2 

This is again not solvable by using analytical method. 



A. 2 Markov Chain Monte Carlo estimation 

The Markov Chain Monte Carlo method generates a sequence of draws that approaches the 
posterior distribution of interest. Our solution consists of steps as follows. 

Step 1. Generate z, z follows truncated normal distribution. 

z ~ TrunNormal n (X/3 + 0, I n ) 
where I n is the n x n identity matrix. If = 1, then zi > 0, if y,i = 0, then Z{ < 

Step 2. Generate /3, (3 ~ NormaltV^, f2 / g) 
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1. define (3 , where 










2. define D = hl n , D is a baseline variance matrix, corresponding to the prior p(f3), where 
h is a large constant, e.g. 400. 



D 1 = 



ol 







al ... 



... ol 



Set <Tq as a small number close to 0, compared with Normal(0, 1), where <Tq = 1 

3. = (D^ + XTX) -1 
This is because: 



z = X/3 + + e 
/3 = X _1 (z - - e) 



.-./3~ Normal (X^z - 0), (X^)- 1 ) 

Based on law of initial values, Qp = (D _1 + X T X) 



-i 



4. Then i/^ can be represented by vp = ftp (X T (z — 0) + D _1 ) 
Step 3. Generate 0, ~ Normal^, fie) 
I. First, define B = I n — ^p^Wj 

= J>W, + u 
(/„ - J>Wi)0 = u 

i 

B0 = u 
= B x u 
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Let Var(u) = a 2 L, 



Var(0) = Vax(B _1 u) 

= (B T B)-V 2 /„ 
B T B 



a 2 

B T B\ _1 B T B ( bbN 

2. Then fie = In H tt~ We then add an offset I n to — So fie = ^~ 

(7 / <7 2 \ (J 



3. v e = Q g (z - X/3), since = (z - X/3) - e 
Step 4. Generate a 2 , a 2 ~ InvGamma(a, 6) 



n 

a = s + - 
b= : 



T B T B0+ — 



where so and go are the parameters for the conjugate prior of a 2 , and n is the size of data. 

Step 5. Finally we generate coefficient for W, p i: using Metropolis-Hasting sampling with 
a random walk chain. 



pT w = pT + Aj, 

where the increment random variable Aj ~ Normal (z/a, £)a) 
The accepting probability a is obtained by: 



|exp 1—^—^0 B neu ,B ne?i ,0 J ' ! 

min -. 

|B oM | exp I --^i Bj M B oM 

A. 3 Validation of Bayesian Software 

One challenge of Bayesian methods is getting an error-free implementation. Bayesian solu- 
tions often have high complexity, and a lack of software causes many researchers to develop 
their own, greatly increasing the chance of software error; many models are not validated, 
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and many of them have errors and do not return correct estimations. So it is very necessary 
to confirm that the code returns correct results. The validation of Bayesian software imple- 
mentations has a short history; we wrote a program using a standard method, the method 



of posterior quantiles Cook et al. (2006), to validate our software. This method again is 
a simulation-based method. The idea is to generate data from the model and verify that 
the software will properly recover the underlying parameters in a principled way. First, we 
draw the parameters 9 from its prior distribution p(Q), then generate data from distribution 
p(y | 9). If the software is correctly coded, the quantiles of each true parameter should be 
uniformly distributed with respect to the algorithm output. For example, the 95% credible 
interval should contain the true parameter with probability 95%. Assume we want to es- 
timate the parameter 9 in Bayesian model p{9 \ y) = p(y \ 9)p(9), where p(9) is the prior 
distribution of 9, p(y \ 9) is the distribution of data, and p{9 \ y) is the posterior distribution. 
The estimated quantile can be defined as: 



1 N 

q(9 ) = P(9 < 9 ) = < Oo) 



i=l 

where 9q is the true value drawn from prior distribution; 9 is a series of draw from posterior 
distribution generated by the software to-be-tested; N is the number of draws in M CM C. The 
quantile is the probability of posterior sample smaller than the true value, and the estimated 
quantile is the number of posterior draws generated by software smaller than the true value. 
If the software is correctly coded, then the quantile distribution for parameter 9, q(9 ) should 



approaches Uniform(0, 1), when N — > oo Cook et al. (2006). The whole process up to now is 



defined as one replication. If run a number of replications, we expect to observe a uniformly 
distribution q(9o) around 9$, meaning posterior should be randomly distributed around the 
true value. 

We then demonstrate the simulations we ran. Assume the model we want to estimate is: 

z = Xift + X 2 /3 2 + + e; 
= Pl W!0 + p 2 W 2 + u 

We then specified a prior distribution for each parameter, and use MCMC to simulate the 
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posterior distributions. 



f3 ~ Normal(0,l); 
cr 2 ~ InvGamma(5, 10); 
p ~ Normal(0.05,0.05 2 ) 

We performed a simulation of 10 replications to validate our hierarchical Bayesian MCMC 
software. The generated sample size for X is 50, so the size of the network structure W is 
50 by 50. In each replication we generated 20000 draws from the posterior distribution of 
all the parameters in (f) ((f) = {(3i, fa, P\,Pi, c 2 }), and kept one from every 20 draws, yielding 
1000 draws for each parameter. We then count the number of draws larger than the true 
parameters in each replication. If the software is correctly written, each estimated value 
should be randomly distributed around the true value, so the number of estimates larger 
than the true value should be uniformly distributed among the 10 replications. We pooled 
all these quantiles for the five parameters, 50 in total, and the sorted results are shown in 
Figure [5] 




10 20 30 40 50 

Exp No 



Figure 5: Distribution of sorted quantiles of parameters, fa, fa, pi, p 2 , cr 2 , over 10 replications. 
The roughly uniform distribution indicates that the algorithm code functions correctly for 
data simulated from the model. 
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A. 4 Solution diagnostic 

We run MCMC experiment to confirm there is no autocorrelation among draws of each 
parameter. In this experiment, we set the length of MCMC chain as 30,000, burn-in as 
10,000, and thinning as 20, which is used for removing the autocorrelations between draws. 
The trace plots generated from our code for the 1000 draws after burn-in and thinning are 
listed in the Figure [6] below. 



E(p ) = -20.5 Efjj) = -0.0883 E(fe) = 0.0372 E(p s ) = -0.246 
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Figure 6: Trace plot of a two-network auto-probit model, flo'- coefficient of constant term, 
0i\ coefficient of car price; coefficient of car's optional accessory; f3 3 : coefficient of 
consumer's age; (3^. coefficient of consumer's income; (3$: coefficient of consumer's ethnicity; 
/3 6 : coefficient of residence longitude; (3 7 : coefficient of residence latitude; p\\ coefficient 
of first network autocorrelation term, Wi, cohesion; p2. coefficient of the second network 
autocorrelation term, W2, structural equivalence; a 2 : estimated variance of the error term 
in autocorrelation. 

We have 12 plots total. Each plot depicts draws for a particular parameter estimation. 
The first 9 plots, from left to right and top to bottom, are the trace for the Pi, coefficient of 
independent variables. Each point represents the value of estimated coefficient and the 
red line represents the mean. We observe all fys are randomly distributed around the mean, 
and the mean is significant, showing the estimation results are valid. The 10th and 11th 
plots are for the two estimated network effect coefficients p\ and p\. We found both pi are 



also significant, and randomly distributed around their means. The only coefficient showing 
autocorrelation is a 2 . 

Note that not all values of p\ and pi can make B (B = I n — p{W\ — P2W2) invertible. The 
plot below shows the relationship between the values of p\ and P2, and the invertibility of B. 
The green area is where B is invertible, and red area is otherwise. If limit draws to the green 
area, we will have correlated p\ and pi- When we draw p\ and P2 using bivariate normal, 
there is no apparent correlation between them (see Figure [7]). We understand the correlation 
between pi and p 2 comes from the definition of Wi and W 2 , not the prior non-correlation. 




Figure 7: Regions of validity for pi and p2 for which B is invertible (green) or not (red). 
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A. 5 W as a mixture of matrices 



Yang and Allenby 2003 ) specified the autoregressive matrix W as a finite mixture of coefficient 
matrices, each related to a specific covariate: 

n 

1=1 

n 

5> = 1 



i=i 



where % represents the indices of the covariates, i = 1... n. 0j is the correspondent weight of 
the component matrix Wj. W, is associated with a covariate Xj. 
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